Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gbn: Sender side only packeting resend bug fix #87

Conversation

ViktorTigerstrom
Copy link
Contributor

@ViktorTigerstrom ViktorTigerstrom commented Nov 7, 2023

Fixes #89

This PR addresses the packeting resend bug through a sender only side fix.

This PR ensures that after we have resent the queue, we will wait until we know that both parties are in sync before we continue to send new packets. This ensures that we don't end up in an indefinitely resend loop due to latency and delayed NACKs by the other party, which could happen prior to this commit.

To understand why we need to await the awaited ACK/NACK after resending the queue, consider the following scenario:

  1. Alice sends packets 1, 2, 3 & 4 to Bob.
  2. Bob receives packets 1, 2, 3 & 4, and sends back the respective ACKs.
  3. Alice receives ACKs for packets 1 & 2, but due to latency the ACKs for packets 3 & 4 are delayed and aren't received until Alice resend timeout has passed, which leads to Alice resending packets 3 & 4. Alice will after that receive the delayed ACKs for packets 3 & 4, but will consider that as the ACKs for the resent packets, and not the original packets which they were actually sent for. If we didn't wait after resending the queue, Alice would then proceed to send more packets (5 & 6).
  4. When Bob receives the resent packets 3 & 4, Bob will respond with NACK 5. Due to latency, the packets 5 & 6 that Alice sent in step (3) above will then be received by Bob, and be processed as the correct response to the NACK 5. Bob will after that await packet 7.
  5. Alice will receive the NACK 5, and now resend packets 5 & 6. But as Bob is now awaiting packet 7, this send will lead to a NACK 7. But due to latency, if Alice doesn't wait resending the queue, Alice will proceed to send new packet(s) before receiving the NACK 7.
  6. This resend loop would continue indefinitely, so we need to ensure that Alice waits after she has resent the queue, to ensure that she doesn't proceed to send new packets before she is sure that both parties are in sync.

To ensure that we are in sync, after we have resent the queue, we will await that we EITHER:

  1. Receive a NACK for the sequence number succeeding the last packet in the resent queue i.e. in step (3) above, that would be NACK 5.
  2. Receive an ACK for the last packet in the resent queue i.e. in step (3) above, that would be ACK 4. After we receive the expected ACK, we will then wait for the duration of the resend timeout before continuing. The reason why we wait for the resend timeout before continuing, is that the ACKs we are getting after a resend, could be delayed ACKs for the
    original packets we sent, and not ACKs for the resent packets. In step (3) above, the ACKs for packets 3 & 4 that Alice received were delayed ACKs for the original packets. If Alice would have immediately continued to send new packets (5 & 6) after receiving the ACK 4, she would have then received the NACK 5 from Bob which was the actual response to the resent queue. But as Alice had already continued to send packets 5 & 6 when receiving the NACK 5, the resend queue response to that NACK would cause the resend loop to continue indefinitely.

When either of the 2 conditions above are met, we will consider both parties to be in sync, and we can proceed to send new packets.

TODO:

  • Update the with extra docs + update commit messages with more detailed docs

Copy link
Contributor Author

@ViktorTigerstrom ViktorTigerstrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with docs, and made some small changes to the PR! This is now ready for review 🔥.

I'll address CI failures tomorrow, but the PR should be reviewable even before that happens.

I've confirmed locally through manual testing, that this PR causes no issues if the two parties have different resend timeouts. But, something I'd like to test is the interoperability if the client has updated to the new LNC version, but the server hasn't. Would be good to ensure that that doesn't cause more resends by the server than prior to this PR.

gbn/gbn_conn.go Outdated Show resolved Hide resolved
gbn/gbn_conn.go Outdated Show resolved Hide resolved
Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice investigation!!! 🎉

Left a bunch of comments - basically, I think we can simplify and modularise the logic quite a bit by letting the queue handle everything.

gbn/gbn_conn.go Outdated Show resolved Hide resolved
gbn/gbn_conn.go Outdated Show resolved Hide resolved
gbn/gbn_conn.go Outdated Show resolved Hide resolved
gbn/gbn_conn.go Outdated Show resolved Hide resolved
gbn/gbn_conn.go Outdated Show resolved Hide resolved
gbn/gbn_conn.go Outdated Show resolved Hide resolved
gbn/gbn_conn.go Outdated Show resolved Hide resolved
gbn/gbn_conn.go Outdated Show resolved Hide resolved
gbn/gbn_conn.go Outdated Show resolved Hide resolved
gbn/gbn_conn.go Outdated Show resolved Hide resolved
gbn/gbn_conn.go Outdated Show resolved Hide resolved
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch 9 times, most recently from 8f8ecc9 to 09b83a0 Compare November 13, 2023 16:04
Copy link
Contributor Author

@ViktorTigerstrom ViktorTigerstrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the review @ellemouton!!

I've addressed your feedback, and I totally agree that it makes much more sense to let the queue struct handle the resend catch up logic instead, so I've moved it there!

Please also check my response here, and let me know what you'd like me to do with the catchUpID: #87 (comment)

gbn/queue.go Outdated Show resolved Hide resolved
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch from 09b83a0 to dd6046e Compare November 13, 2023 17:54
@ViktorTigerstrom
Copy link
Contributor Author

The itest CI that failed this round of the CI workflow is pre-existing since before this PR. It works sometimes, and fails sometimes.

@ellemouton
Copy link
Member

The itest CI that failed this round of the CI workflow is pre-existing since before this PR. It works sometimes, and fails sometimes.

You are able to trigger a manual re-run right?

Also - can you make an issue for this flake so that some investigation can be done?

@ViktorTigerstrom
Copy link
Contributor Author

I've learned how to trigger a re-run of the workflow now :)! Also opened a new issue for the error. Thanks for the nudge :)!

Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check out this patch for how we can do a bit of cleanup: gbn_cleanup.patch

Left a few comments - Im mainly concerned with adding complexity to the system to solve this issue. I think we should maybe first focus on reducing the complexity and then tackle the issue at hand.

So im thinking we should really try to divide up what exactly is being handled by GBN and what is being handled by queue. I think we can make use of a state-machine model too. Currently it seems like the complexity is coming from the fact that both the queue and the GBN can tricker timeouts & resends. I think this should only be handled by one entity. So for example, if we are in the "resend&sync" state, then nothing can trigger a new resent&sync state since we are already in that state.

Let me know what you think.

gbn/queue.go Outdated Show resolved Hide resolved
gbn/queue.go Outdated Show resolved Hide resolved
gbn/gbn_conn.go Outdated
@@ -578,7 +591,10 @@ func (g *GoBackNConn) receivePacketsForever() error { // nolint:gocyclo
}

case *PacketACK:
gotValidACK := g.sendQueue.processACK(m.Seq)
gotValidACK := g.sendQueue.processACK(
m.Seq, g.resendTimeout,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we never change the value of g.resentTimeout, then why do we need to pass it to sendQueue each time we call processACK? It can instead be passed to it on instantiation and be a member of queue right?

Copy link
Contributor Author

@ViktorTigerstrom ViktorTigerstrom Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the problem with just setting the resendTimeout through the constructor, is that it the queue's resendTimeout then currently won't get set to the value set by the WithTimeout option (Though that's easily fixed of course).

Though I've fixed that issue in #88 now, by the addition of the TimeoutManager. If you agree, we'll let this PR still pass the timeout in the different function calls to the queue, and then remove that with #88 instead?

gbn/queue.go Outdated Show resolved Hide resolved
gbn/gbn_conn.go Outdated Show resolved Hide resolved
gbn/gbn_conn.go Outdated Show resolved Hide resolved
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch 7 times, most recently from f449c3a to a351b99 Compare November 17, 2023 18:58
// As we can't be sure that waitForSync cancel listener was
// triggered before this one, we send over the cancel channel
// again, to make sure that both listeners are triggered.
c.reset()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially cover this instead by closing the cancel channel when executing reset function, and creating a new channel on every sync init (and also use a sync.Once to ensure we only cancel the channel once per sync). But I figured this version is probably a bit more clean.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync.Once to ensure we only cancel the channel once per sync

But then you would need to re-init the sync.Once each time too which kinda defeats the purpose right?

I think the state machine approach is a bit nicer than closing & recreating channels.

If you want to make sure that the reset only does something if we actually need it to,
then, in the reset method, you could check that we are in the "Resending" state, if we are not, it becomes a no-op. That way, the state machine becomes a bit more safe. we could add the same type of check anywhere else where we change the state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then you would need to re-init the sync.Once each time too which kinda defeats the purpose right?

Yes exactly, which is why I opted for this solution :)!

If you want to make sure that the reset only does something if we actually need it to,
then, in the reset method, you could check that we are in the "Resending" state, if we are not, it becomes a no-op. That way, the state machine becomes a bit more safe. we could add the same type of check anywhere else where we change the state

Hmmm I'm not sure how this would address that we wouldn't need to call the c.reset() both here and at the end of the waitForSync function as we can't be sure which of the select cases here + or in the waitForSync will catch the <-c.cancel case first? We need to ensure that both trigger when c.cancel channel get's sent over. Is this a case where you'd use a channel with a size of 2, or is there a better way to solve this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds to me like this could be instead handled by a sync.Cond. But IMO that is harder to understand (and also not too easy to use in the first place). So I leave it to you if you want to refactor or leave it the way it is now.

@ellemouton
Copy link
Member

ellemouton commented Dec 13, 2023

I've now added it to this PR, after adjusting it slightly to cover some edge cases 🚀!

Awesome!

Let me know if you'd like me to remove the initial resending loop protection commits, or if you think it makes sense to keep this refactor as a final separate commit.

Yeah - ideal PR structure would be: 1) Add new syncer + test, then 2) inject it & use it in queue . Otherwise we sorta have to review the logic twice. (dont worry about keeping me as commit author for the syncer refactor - it was just putting your code in another file anyways 😊 )

Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some comments :)
I think your idea of first adding the syncer & then injecting the usage is a good one and will make reviewing easier since we then only need to review the logic once vs 2 versions of similar logic.

One thing that concerns me is that it seems (from looking at the results of some of the latest GH action runs) that with this PR, the speed of the reconnection test has strictly been increased and so is resulting in more timeouts.

before PR

    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (9.82s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (14.81s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (12.78s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:true) (16.00s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (11.80s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (19.80s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:false) (21.29s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:true) (19.86s)


    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (9.82s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (13.81s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (10.95s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:true) (14.66s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (15.82s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (19.80s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:false) (22.18s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:true) (18.52s)

    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (9.82s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (13.84s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (12.71s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:true) (15.70s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (21.82s)
    --- FAIL: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (33.75s)


with PR

    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (9.82s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (28.67s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (24.85s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:true) (16.08s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (27.85s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (41.81s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:false) (23.67s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:true) (55.31s)

    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (29.89s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (29.85s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (12.09s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:true) (29.47s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (38.83s)
    --- FAIL: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (35.50s)

    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (14.83s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (33.87s)
    --- FAIL: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (33.06s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (31.83s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (39.80s)
    --- FAIL: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:false) (32.93s)

Is this perhaps a necessary/expected increase due to the fact that we now back-off a bit?

gbn/syncer.go Outdated Show resolved Hide resolved
gbn/queue_test.go Show resolved Hide resolved
gbn/syncer.go Outdated Show resolved Hide resolved
gbn/syncer.go Outdated Show resolved Hide resolved
// As we can't be sure that waitForSync cancel listener was
// triggered before this one, we send over the cancel channel
// again, to make sure that both listeners are triggered.
c.reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync.Once to ensure we only cancel the channel once per sync

But then you would need to re-init the sync.Once each time too which kinda defeats the purpose right?

I think the state machine approach is a bit nicer than closing & recreating channels.

If you want to make sure that the reset only does something if we actually need it to,
then, in the reset method, you could check that we are in the "Resending" state, if we are not, it becomes a no-op. That way, the state machine becomes a bit more safe. we could add the same type of check anywhere else where we change the state

gbn/queue.go Show resolved Hide resolved
gbn/queue.go Outdated
@@ -101,8 +118,6 @@ func (q *queue) resend(cb func(packet *PacketData) error) error {
return nil
}

q.lastResend = time.Now()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are we setting it now thought?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch, my bad!

gbn/queue.go Outdated
@@ -112,36 +127,48 @@ func (q *queue) resend(cb func(packet *PacketData) error) error {
q.topMtx.RUnlock()

if base == top {
q.syncer.reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a reset here?

Initially in my diff I had it here cause there was an init state above this. but if that is removed then this can be removed right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw - the reason i had the other "init" state here was to remove the need for locking the awaitMu mutex before doing the seqBase and seqTop locks.

if we still want there to be no other "resend" call during the time this is going, then we should add that back.

but i think this is synchronous for now right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a reset here

Ah my bad, missed removing that. Thanks!

btw - the reason i had the other "init" state here was to remove the need for locking the awaitMu mutex before doing the seqBase and seqTop locks.

Yeah, it's synchronous currently, so we will never call the queue.resend function twice at the same time while it's currently resending the function.
Since we now process the awaited ACK+NACK while we are also resending packets, we now need to ensure that we set the syncStateResending while under the same lock that we also set the awaitedACK + awaitedNACK which is why I changed this. But like you say, if we add an extra init state to the syncer we could achieve that we wouldn't allow two concurrent calls of the queue.resend function. Opted to not add it now though to not cause extra confusion with unused functionality, butcan change it if you'd prefer :).

gbn/queue.go Outdated
Comment on lines 429 to 247
if seq == q.sequenceTop {
q.sequenceBase = q.sequenceTop
}

// If we receive the awaited NACK, we shouldn't trigger a new
// resend, as we can now proceed to send new packets.
return false, false
}

// If the NACK is the same as sequenceTop, and we weren't awaiting this
// NACK as part of the resend catch up, it probably means that queue
// was sent successfully, but we just missed the necessary ACKs. So we
// can empty the queue here by bumping the base and we dont need to
// can empty the queue here by bumping the base and we don't need to
// trigger a resend.
if seq == q.sequenceTop {
q.sequenceBase = q.sequenceTop
return true, false

return false, false
}

// Is the NACKed sequence even in our queue?
if !containsSequence(q.sequenceBase, q.sequenceTop, seq) {
q.cfg.log.Tracef("NACK seq %d is not in the queue. Ignoring.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems the "bumped" logic is changed here only to be changed again in a later commit...

If there is a bug in the original logic - perhaps better to fix that in it's own commit before any additions. Cause these 2 diffs are a bit confusing to follow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good idea, cleaned this up with the latest push :)!

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch from a1ee556 to b101a16 Compare December 16, 2023 02:41
@ViktorTigerstrom
Copy link
Contributor Author

Thanks a lot for the review @ellemouton! I've updated the commit history quite extensively with the latest push, and addressed your feedback, so I hope this version looks better :)! 🚀

Is this perhaps a necessary/expected increase due to the fact that we now back-off a bit?

Yeah this was part of why I added the IsPingPacket logic earlier in the PR that we later remove, as these reconnect tests now often trigger a resend syncing with ping packet which will time out (i.e. await the resendTimeout*3).
So it is expected, but a bit unfortunate. I'll experiment a bit with different timeouts later when addressing #89, or is the timeouts something you'd like me to update with this PR :)?

Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me, great work!
But it seems like the server or client disconnect integration tests fail pretty consistently. Do you think that's something that needs to be addressed with different timeouts (e.g. #88) or should they pass reliably for this PR too?

I'm happy to merge this as-is if one of the PRs down the road show them passing reliably.

// As we can't be sure that waitForSync cancel listener was
// triggered before this one, we send over the cancel channel
// again, to make sure that both listeners are triggered.
c.reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds to me like this could be instead handled by a sync.Cond. But IMO that is harder to understand (and also not too easy to use in the first place). So I leave it to you if you want to refactor or leave it the way it is now.

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch from b101a16 to 82af2e2 Compare January 3, 2024 00:58
@lightninglabs-deploy
Copy link

@ellemouton: review reminder

@ellemouton
Copy link
Member

I think the tests are timing out since we now don’t just keep resending right? the test that times out is the reconnection test - meaning that packets are getting dropped. So before, we would resend more frequently & now we back-off quite a bit.

So I think if we are confident that this PR is an improvement on reliability then we should just increase the itest timeout to whatever the average timeout required by that test is. Looks like 1 min should be fine right?

In order to ensure reliability, I suggest running a simulation overnight that keeps sending packets back and forth and occasionally does a restart of the aperture server - the test is successful if by the end of the test (after 12 hours or w/e) the client and server are still in sync & no resend loops are occurring

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch from 82af2e2 to c95b563 Compare January 4, 2024 02:35
@ViktorTigerstrom
Copy link
Contributor Author

Ok so I've finally figured out what's going on in the itests! There were 2 core problems, 1 which was caused by a bug in this PR, and 1 which is pre-existing:

Issue 1:

The bug which was caused by this PR mainly effected the testClientReconnect test and was caused by me missing that when calling g.resendTicker.Reset(g.cfg.resendTimeout) in the resendQueue func in GoBackNConn.sendPacketsForever, that doesn't also drain the ticker's channel if there's a tick that has occoured before reseting the timer.
That caused the resendQueue func to be called over and over again when the queue syncer times out, and the client is down and never sending any ACKS/NACKS back, as the ticker always ticks when the syncer times out.

I've addressed this with the latest push!

Issue 2:

The problem is pre-existing, and is not made worse by this PR. This is effecting the testServerReconnect test, and is what happening for #89.

So the core issue is that when we stop the server here, the server sends a FIN packet, and then closes the connection:

We then verify that client's connection status has change to mailbox.ClientStatusSessionNotFound on the line below.

But as there's a sliiiight delay on between when the FIN packet is sent, and when the connection is closed, 25-50% of the time when running the test, if the client receives the FIN packet from the server and calls recvFromStream in GoBackNConn.receivePacketsForever before the connection has been closed, the client will never move the connection status to mailbox.ClientStatusSessionNotFound.

When client then processes the FIN packet, the client will then close down and never change the the connection status.

Therefore the predicate in the testServerReconnect test that verifies that the connection status changes to mailbox.ClientStatusSessionNotFound within the timeout will never be satisfied, no matter how long the timeout is set to.

For the cases when the test is run and instead actually succeeds, the connection is already closed when the connection recvFromStream call is made, and the status is changed as per the implementation here (as c.transport.Recv(ctx) will error):

c.setStatus(errStatus)

I'm not a 100% sure how we should handle this correctly, so would appreciate any feedback on how you think this should be handled best! One potential solution that I experimented with would be to trigger the mailbox client_conn to recheck the connection status once the client's gbn_conn processes a FIN packet in the receive function. That will change the client's connection status to change to ClientStatusNotConnected after the connection has been closed, so given that we change the test verification predicate to also accept that status, the test will pass in this case. Does that sound like a reasonable solution, or do you have a better suggestion?


Finally, it's expected that the itests should take a bit longer for the reconnection tests as we will now not resend as frequently and back-off quite a bit as @ellemouton points out above. However I actually don't think we need to increase any timeouts, and it's mainly the 2 issues I've detailed above that's been causing issues for the itests 🚀!

In order to ensure reliability, I suggest running a simulation overnight that keeps sending packets back and forth and occasionally does a restart of the aperture server - the test is successful if by the end of the test (after 12 hours or w/e) the client and server are still in sync & no resend loops are occurring

This is a great idea, I'll implement that simulation asap and let it run for a while as suggested 🔥!

@guggero guggero self-requested a review January 4, 2024 08:19
@ellemouton
Copy link
Member

Really nice investigation @ViktorTigerstrom 🔥 🔥

regarding how to solve for issue 2, could an onFIN callback work? see:
onFin.patch

Copy link
Contributor Author

@ViktorTigerstrom ViktorTigerstrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the patch @ellemouton! That worked really well 🔥🚀!
I've added it on top of this PR, so itests are now finally consistently passing 🎉!

gbn.WithKeepalivePing(
gbnClientPingTimeout, gbnPongTimeout,
),
gbn.WithOnFIN(func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially add this to the server_conn.go version as well, and force the server to set the connection status to status ServerStatusIdle as well. But I haven't really seen any case where this has caused any issues on the server side yet, so opted to not add it. But Will add it to the server side as well if you think I should!.

Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome investigation on the itest flake hunt! Very nice to have things pass so consistently now.

Great work, LGTM 🎉

Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tACK, LGTM! Great investigations & great work!! 💥 🚀

The majority of my comments are spelling/grammar nits

gbn/queue.go Outdated Show resolved Hide resolved
gbn/syncer.go Outdated Show resolved Hide resolved
gbn/syncer.go Outdated Show resolved Hide resolved
gbn/syncer.go Outdated Show resolved Hide resolved
gbn/syncer.go Show resolved Hide resolved
gbn/syncer_test.go Outdated Show resolved Hide resolved
gbn/syncer_test.go Outdated Show resolved Hide resolved
gbn/syncer_test.go Outdated Show resolved Hide resolved
gbn/syncer_test.go Outdated Show resolved Hide resolved
gbn/gbn_conn.go Show resolved Hide resolved
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch from 4998324 to 1e9d938 Compare January 5, 2024 13:41
@ViktorTigerstrom
Copy link
Contributor Author

Thanks a lot for the review @ellemouton!
Left some comments on mainly #87 (comment) & #87 (comment). Once that has been resolved, I think this should be good to go!

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch from 1e9d938 to f41a037 Compare January 8, 2024 10:30
The processNACK function returns two booleans as its signature.
The first boolean used to only indicate if the NACK was part of the
queue, but is with this commit expanded to indicate if the queue needs
to be resent or not as a result of the NACK.

The second boolean indicates if the queue sequence base was bumped or
not. Previously, the bumped value for processNACK didn't accurately
reflect if the queue sequence base was actually bumped or not for the
cases when we didn't trigger a resend. This commit addresses that issue.
Updating the bumped value to be correct has no real impact on the
current implementation of the gbn conn, but it might start having an
effect if the implementation is modified in the future.
The syncer struct can be used to ensure that both the sender and the
receiver are in sync before the waitForSync function is completed. This
ensures that we don't end up in a loop where one party keeps resending
packets over and over again. This is done by awaiting that we receive
either the an expected ACK or an expected NACK after resending the
queue. The expected ACK is the ACK for the last packet that has been
resent, and the expected NACK is the NACK sequence following the last
packet that has been resent.

To understand why we need to await the awaited ACK/NACK after resending
the queue, consider the following scenario:

1.
Alice sends packets 1, 2, 3 & 4 to Bob.
2.
Bob receives packets 1, 2, 3 & 4, and sends back the respective ACKs.
3.
Alice receives ACKs for packets 1 & 2, but due to latency the ACKs for
packets 3 & 4 are delayed and aren't received until Alice resend timeout
has passed, which leads to Alice resending packets 3 & 4. Alice will
after that receive the delayed ACKs for packets 3 & 4, but will consider
that as the ACKs for the resent packets, and not the original packets
which they were actually sent for. If we didn't wait after resending the
queue, Alice would then proceed to send more packets (5 & 6).
4.
When Bob receives the resent packets 3 & 4, Bob will respond with
NACK 5. Due to latency, the packets 5 & 6 that Alice sent in step (3)
above will then be received by Bob, and be processed as the correct
response to the NACK 5. Bob will after that await packet 7.
5.
Alice will receive the NACK 5, and now resend packets 5 & 6. But as Bob
is now awaiting packet 7, this send will lead to a NACK 7. But due to
latency, if Alice doesn't wait resending the queue, Alice will proceed
to send new packet(s) before receiving the NACK 7.
6.
This resend loop would continue indefinitely, so we need to ensure that
Alice waits after she has resent the queue, to ensure that she doesn't
proceed to send new packets before she is sure that both parties are in
sync.

To ensure that we are in sync, after we have resent the queue, we will
await that we either:
1. Receive a NACK for the sequence number succeeding the last packet in
the resent queue i.e. in step (3) above, that would be NACK 5.
OR
2. Receive an ACK for the last packet in the resent queue i.e. in step
(3) above, that would be ACK 4. After we receive the expected ACK, we
will then wait for the duration of the resend timeout before continuing.
The reason why we wait for the resend timeout before continuing, is that
the ACKs we are getting after a resend, could be delayed ACKs for the
original packets we sent, and not ACKs for the resent packets. In step
(3) above, the ACKs for packets 3 & 4 that Alice received were delayed
ACKs for the original packets. If Alice would have immediately continued
to send new packets (5 & 6) after receiving the ACK 4, she would have
then received the NACK 5 from Bob which was the actual response to the
resent queue. But as Alice had already continued to send packets 5 & 6
when receiving the NACK 5, the resend queue response to that NACK would
cause the resend loop to continue indefinitely.

When either of the 2 conditions above are met, we will consider both
parties to be in sync, and we can proceed to send new packets.
When the client receives+processes a FIN packet, we force the connection
to status ClientStatusSessionNotFound, as in rare occasions the
corresponding server may have time to close the connection before we've
already processed the sent FIN packet by the server. In that case, if we
didn't force a new status, the client would never mark the connection as
status ClientStatusSessionNotFound.

This solves a bug for the testServerReconnect itest, as that test would
sometimes fail due to the FIN packet being processed but without
changing the connection status, leading to a test predicate never being
satisfied.
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch from f41a037 to 5e16121 Compare January 8, 2024 10:33
@ellemouton ellemouton merged commit 445d935 into lightninglabs:master Jan 8, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

itest: CI often fails with "predicate not satisfied after time out" error
4 participants